Batch effects refer to sources of unwanted variation that are
unrelated to and can obscure the biological factors of interest.
Handling batch effects is a necessary step to improve the
reproducibility of research studies. However, methods developed for
batch effects often rely on strong assumptions about data
characteristics and the types of batch effects.
This hands-on practice provides guidelines for batch effect
detection, management, and evaluation of method effectiveness through
visual and numerical approaches. Although the practice will illustrate
the workflow using microbiome data, the concepts and techniques
presented are broadly applicable to any type of omics data.
Case study description
Anaerobic Digestion This study explored microbial
indicators that could improve the efficacy of the Anaerobic Digestion
(AD) bioprocess and prevent its failure (Chapleur
et al. 2016). Samples were treated with two different ranges of
phenol concentrations (effect of interest) and processed on five
different dates (batch effect). This study exhibits a clear and strong
batch effect with an approx. balanced batch x treatment design.
Packages installation and loading
First, let’s load the packages necessary for the analysis, and check
the version of each package.
Before detecting batch effects, the data should be properly
preprocessed. The preprocessing steps may vary depending on the type of
omics data. Since the example data used here are from microbiome
studies, the steps below illustrate how to preprocess microbial count
data.
Pre-filtering
Let’s load the AD data stored internally in
PLSDAbatch R package with function data() and
extract the count data using function assays() from
TreeSummarizedExperiment R package.
# AD data
data('AD_data')
ad.count <- assays(AD_data$FullData)$Count
dim(ad.count)
[1] 75 567
The raw data include 567 OTUs and 75 samples.
Then, we will use the function PreFL() (
PLSDAbatch package) to filter the data.
# zero proportion before filtering
ad.filter.res$zero.prob
[1] 0.6328042
# zero proportion after filtering
sum(ad.filter == 0)/(nrow(ad.filter) * ncol(ad.filter))
[1] 0.3806638
After filtering, 231 OTUs remained, and the proportion of zeroes
decreased from 63% to 38%.
Note: The PreFL() function is only dedicated to raw
counts, rather than relative abundance data. We also recommend starting
the pre-filtering on raw counts, rather than on relative abundance data
to mitigate compositionality issue.
In addition, don’t forget to extract the batch and treatment
information out.
# extract the metadata
ad.metadata <- rowData(AD_data$FullData)
# extract the batch info
ad.batch = factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date))
# extract the treatment info
ad.trt = as.factor(ad.metadata$initial_phenol_concentration.regroup)
# add names on each
names(ad.batch) <- names(ad.trt) <- rownames(ad.metadata)
Exercise 1: How many samples are there within each treatment and
batch group?
The treatment and batch grouping information corresponds to the same
samples as the count data, which is required for downstream analysis. We
also examined the sample sizes within each treatment and batch group to
assess whether the batch × treatment design is balanced. In the
AD data, the design is approximately balanced.
Transformation
Prior to CLR transformation, we recommend adding 1 as an offset to
the count data. Here, we will use logratio.transfo()
function in mixOmics package to perform the CLR
transformation.
To visualise the data, let’s apply pca() function from
mixOmics package and Scatter_Density() function
from PLSDAbatch to represent the PCA sample plot with
densities.
# AD data
ad.pca.before <- pca(ad.clr, ncomp = 3, scale = TRUE)
Scatter_Density(object = ad.pca.before, batch = ad.batch, trt = ad.trt,
title = 'AD data', trt.legend.title = 'Phenol conc.')
Exercise 2: Interpret the PCA plot created above
Answer
In the figure above, we observed 1) a clear distinction between
samples treated with different phenol concentrations and 2) differences
between samples sequenced on “14/04/2016”, “21/09/2017” and the other
dates. Therefore, the batch effect related to sequencing dates needs to
be removed.
Boxplots and density plots
We can also visualise individual variables (OTUs in this case). For
example, we can select the top OTU driving the major variance in PCA
using selectVar() in mixOmics package. We can then
plot this OTU as boxplots and density plots using
box_plot() and density_plot() in
PLSDAbatch.
density_plot(df = ad.OTU_batch, title = paste(ad.OTU.name, '(AD data)'))
The boxplot and density plot indicated a strong date effect because
of the differences between “14/04/2016”, “21/09/2017” and the other
dates in the “OTU28”.
To assess statistical significance, we can apply a linear regression
model to “OTU28” using linear_regres() from
PLSDAbatch with batch and treatment effects as covariates. To
compare “14/04/2016” and “21/09/2017” to the other batches, we need to
set them as the reference levels, respectively, using
relevel() from stats.
Call:
lm(formula = data[, i] ~ trt + batch.fix)
Residuals:
Min 1Q Median 3Q Max
-1.9384 -0.3279 0.1635 0.3849 0.9887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.1147 0.2502 8.453 2.97e-12 ***
trt1-2 -1.6871 0.1754 -9.617 2.27e-14 ***
batch.fix14/04/2016 -1.2646 0.2690 -4.701 1.28e-05 ***
batch.fix09/04/2015 0.3317 0.3139 1.056 0.29446
batch.fix01/07/2016 0.8193 0.2573 3.185 0.00218 **
batch.fix14/11/2016 0.4759 0.2705 1.760 0.08292 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7033 on 69 degrees of freedom
Multiple R-squared: 0.7546, Adjusted R-squared: 0.7368
F-statistic: 42.44 on 5 and 69 DF, p-value: < 2.2e-16
We observed P < 0.001 for the regression coefficients associated
with all other batches when “14/04/2016” was set as the reference level.
This confirms the differences between the samples from batch
“14/04/2016” and those from the other batches, as previously observed in
the plots.
Exercise 3: Interpret the result when “21/09/2017” was set as the
reference level
Answer
When “21/09/2017” was set as the reference level, we observed
significant differences between batch “21/09/2017” and “14/04/2016”, as
well as between “21/09/2017” and “01/07/2016”. These results indicate
that a batch effect associated with “21/09/2017” also exists.
Heatmap
We can also use a heatmap to visualise the data using
pheatmap package. The data first need to be scaled across both
OTUs and samples.
# scale the clr data on both OTUs and samples
ad.clr.s <- scale(ad.clr, center = TRUE, scale = TRUE)
ad.clr.ss <- scale(t(ad.clr.s), center = TRUE, scale = TRUE)
ad.anno_col <- data.frame(Batch = ad.batch, Treatment = ad.trt)
ad.anno_colors <- list(Batch = color.mixo(seq_len(5)),
Treatment = pb_color(seq_len(2)))
names(ad.anno_colors$Batch) = levels(ad.batch)
names(ad.anno_colors$Treatment) = levels(ad.trt)
pheatmap(ad.clr.ss,
cluster_rows = FALSE,
fontsize_row = 4,
fontsize_col = 6,
fontsize = 8,
clustering_distance_rows = 'euclidean',
clustering_method = 'ward.D',
treeheight_row = 30,
annotation_col = ad.anno_col,
annotation_colors = ad.anno_colors,
border_color = 'NA',
main = 'AD data - Scaled')
In the heatmap, samples from the batch dated “14/04/2016” clustered
together and were distinct from the other samples, indicating a clear
batch effect.
pRDA
To quantitatively evaluate batch effects, we can apply pRDA with
varpart() function from vegan R package.
# AD data
ad.factors.df <- data.frame(trt = ad.trt, batch = ad.batch)
head(ad.factors.df)
In the result, X1 and X2 represent the
first and second covariates fitted in the model. [a] and
[b] indicate the independent proportions of variance
explained by X1 and X2, respectively, while
[c] represents the shared variance between X1
and X2. In the AD data, the variance
explained by batch (X2) was larger than that explained by
treatment (X1), with some intersection (shared variance)
reflected in [c] (Adj.R.squared = 0.013). A greater shared
variance suggests a more unbalanced batch × treatment design. Therefore,
in this study, we considered the design to be approximately
balanced.
Managing batch effects
Accounting for batch effects
The methods we use to account for batch effects include those
specifically designed for microbiome data, such as the zero-inflated
Gaussian (ZIG) mixture model (see the section “To go further”), as well
as methods adapted for microbiome data, including linear regression, SVA
(see “To go further”) and RUV4. Among these, SVA and RUV4 are designed
to handle unknown batch effects.
Linear regression
Linear regression will be conducted using
linear_regres() function in PLSDAbatch. We
integrated the performance package, which assesses the
performance of regression models, into function
linear_regres(). Therefore, we can apply
check_model() from performance to the outputs of
linear_regres() to diagnose the validity of models fitted
with treatment and batch effects for each variable (LÃŒdecke et al. 2020).
We can also extract performance metrics such as adjusted R2, RMSE,
RSE, AIC and BIC for models fitted with and without batch effects, which
are included in the outputs of linear_regres().
Let’s apply type = "linear model" to the AD
data, given its balanced batch x treatment design.
# AD data
# ad.clr <- ad.clr[seq_len(nrow(ad.clr)), seq_len(ncol(ad.clr))]
ad.lm <- linear_regres(data = ad.clr,
trt = ad.trt,
batch.fix = ad.batch,
type = 'linear model',
p.adjust.method = 'fdr')
# p values adjusted for batch effects
ad.p.adj <- ad.lm$adj.p
check_model(ad.lm$model$OTU12)
To assess the validity of the model fitted with both treatment and
batch effects, we can use diagnostic plots to check whether the
assumptions are met for each microbial variable. For example, for
“OTU12”, the assumptions of linearity (or homoscedasticity) and
homogeneity of variance were not satisfied (top panel). No sample was
classified as an outlier with a Cook’s distance greater than or equal to
0.9, however, samples “57”, “39”, “47”, “44” and “16” were relatively
close to the contour lines. The correlation between batch
(batch.fix) and treatment (trt) effects was
very low (< 5), indicating a well-fitted model with low collinearity
(middle panel). The distribution of residuals was very close to normal
(bottom panel). For microbial variables that violate some model
assumptions, their results should be interpreted with caution.
For the performance metrics of models fitted with or without batch
effects, we show results for a subset of variables as an example
only.
head(ad.lm$adj.R2)
The adjusted \(R^2\) of the model
with both treatment and batch effects was higher for all the listed OTUs
compared to the model with treatment effects only, suggesting that
including batch effects explained more variance in the data and resulted
in a better-fitting model.
We can also compare the AIC of models fitted with and without batch
effects.
head(ad.lm$AIC)
A lower AIC indicates a better fit, as seen here for the model fitted
with batch effects across all OTUs.
Both results strongly suggest that batch effects should be included
in the linear model.
RUV4
Before applying RUV4 (RUV4() from ruv package),
we need to specify negative control variables and the number of batch
factors to estimate.
Empirical negative controls that are not significantly differentially
abundant (adjusted P > 0.05), based on a linear regression using
treatment information as the only covariate, can be used here.
Therefore, we will use a loop to fit a linear regression for each
microbial variable and adjust the P values of the treatment effects for
multiple comparisons using p.adjust() from stats.
The empirical negative controls can then be identified based on the
adjusted P values.
The number of batch factors k can be determined using
getK() function.
# estimate k
ad.k.res <- getK(Y = ad.clr, X = ad.trt, ctl = ad.nc)
ad.k <- ad.k.res$k
After all required parameters are estimated, let’s apply
RUV4() with the known treatment variables, the selected
negative control variables, and the estimated number of batch factors
k. The resulting P values should also be adjusted for
multiple comparisons.
Note: A package named RUVSeq has been developed for count
data. It provides RUVg() which uses negative control
variables, as well as other functions such as RUVs() and
RUVr(), which use sample replicates (Moskovicz et al. 2020) or residuals from
regression on treatment effects to estimate and account for latent batch
effects. However, for CLR-transformed data, we still recommend using the
standard ruv package.
Another method, SVA, which accounts for unknown batch effects, can be
found in the section “To go further”.
Correcting for batch effects
The methods we will use to correct for batch effects include ComBat,
PLSDA-batch and RUVIII. Among these, RUVIII is designed to correct for
unknown batch effects. Other methods, such as removeBatchEffect,
sPLSDA-batch and percentile normalisation, can be found in the section
“To go further”.
ComBat
The ComBat() function (from sva package)
supports both parametric and non-parametric correction, controlled by
the option par.prior. For parametric adjustment, the
model’s validity can be assessed by setting prior.plots = T(Leek et al. 2012).
For AD data, we apply a non-parametric correction
(par.prior = FALSE), using the input batch grouping
information (batch) and the treatment design matrix
(mod) to calculate the batch effect corrected data
ad.ComBat. We chose the non-parametric approach based on
the assumption that microbial abundance data, even after CLR
transformation, do not follow a standard distribution.
G3;Found5batches
gG3;Adjusting for1covariate(s) or covariate level(s)
gG3;Standardizing Data across genes
gG3;Fitting L/S model and finding priors
gG3;Finding nonparametric adjustments
gG3;Adjusting the Data
g
PLSDA-batch
Before applying PLSDA-batch, we need to specify the optimal number of
components related to treatment (ncomp.trt) and batch
effects (ncomp.bat).
To determine ncomp.trt, we will use the
plsda() from mixOmics with only the treatment
grouping information to estimate the optimal number of treatment-related
components to retain.
# estimate the number of treatment components
ad.trt.tune <- plsda(X = ad.clr, Y = ad.trt, ncomp = 5)
ad.trt.tune$prop_expl_var #1
We should choose the number of components that explain 100% of the
variance in the outcome matrix Y. From the result, 1
component was sufficient to preserve the treatment information.
To determine ncomp.bat, we will use the
PLSDA_batch() function (PLSDAbatch package) with
both treatment and batch grouping information, along with the specified
number of treatment-related components, to estimate the optimal number
of batch components to remove.
# estimate the number of batch components
ad.batch.tune <- PLSDA_batch(X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, ncomp.bat = 10)
ad.batch.tune$explained_variance.bat
Using the same criterion as for selecting treatment components, we
will choose the number of batch-related components that explain 100% of
the variance in the batch outcome matrix (Y.bat). According
to the result, 4 components are required to remove the batch
effects.
Then let’s correct for batch effects by applying
PLSDA_batch() with the input treatment and batch grouping
information, along with the estimated optimal numbers of related
components.
Note: Comparatively, PLSDA-batch (PLSDAbatch package) is
more suitable for weak batch effects, while from the same package,
sparse PLSDA-batch is better suited for strong batch effects (see
section “To go further”), weighted PLSDA-batch is specifically designed
for unbalanced but not nested batch x treatment designs.
RUVIII
The RUVIII() function is from ruv package.
Similar to RUV4(), it requires empirical negative control
variables and the number of unwanted factors (k) to remove.
We will use those estimated in the RUV4 section. In addition, it
requires sample replicates, which should be structured into a mapping
matrix using replicate.matrix(). With these elements, we
can then obtain the batch effect corrected data by applying
RUVIII().
# order batches
ad.batch = factor(ad.metadata$sequencing_run_date,
levels = unique(ad.metadata$sequencing_run_date))
ad.pca.before.plot <- Scatter_Density(object = ad.pca.before,
batch = ad.batch,
trt = ad.trt,
title = 'Before correction')
ad.pca.ComBat.plot <- Scatter_Density(object = ad.pca.ComBat,
batch = ad.batch,
trt = ad.trt,
title = 'ComBat')
ad.pca.PLSDA_batch.plot <- Scatter_Density(object = ad.pca.PLSDA_batch,
batch = ad.batch,
trt = ad.trt,
title = 'PLSDA-batch')
ad.pca.RUVIII.plot <- Scatter_Density(object = ad.pca.RUVIII,
batch = ad.batch,
trt = ad.trt,
title = 'RUVIII')
Exercise 4: Interpret the PCA plots created above
Answer
As shown in the plots, the differences between samples sequenced on
“14/04/2016”, “21/09/2017” and the other dates were removed after batch
effect correction by all methods. Among these methods, the data
corrected with PLSDA-batch retained more treatment-related variation
than the data corrected by other methods, primarily on the first PC, as
indicated by the x-axis label (26%).
We can also compare boxplots and density plots of key variables
identified by PCA, as well as heatmaps that highlight distinct patterns
before and after batch effect correction. However, due to time
limitations, we will not show the results here.
pRDA
As a quantitative measure, we can calculate the global explained
variance across all microbial variables using pRDA. The results can be
visualised using partVar_plot() from PLSDAbatch
package.
# arrange data before and after batch effect correction into a list
ad.corrected.list <- list(`Before correction` = ad.clr,
ComBat = ad.ComBat,
`PLSDA-batch` = ad.PLSDA_batch,
RUVIII = ad.RUVIII)
ad.prop.df <- data.frame(Treatment = NA, Batch = NA,
Intersection = NA,
Residuals = NA)
# run rda in a loop
for(i in seq_len(length(ad.corrected.list))){
rda.res = varpart(ad.corrected.list[[i]], ~ trt, ~ batch,
data = ad.factors.df, scale = TRUE)
ad.prop.df[i, ] <- rda.res$part$indfract$Adj.R.squared}
rownames(ad.prop.df) = names(ad.corrected.list)
# change the order of explained variance
ad.prop.df <- ad.prop.df[, c(1,3,2,4)]
# remove values less than zero, and recalculate the proportions
ad.prop.df[ad.prop.df < 0] = 0
ad.prop.df <- as.data.frame(t(apply(ad.prop.df, 1,
function(x){x/sum(x)})))
partVar_plot(prop.df = ad.prop.df)
As shown in the figure above, the intersection between batch and
treatment variance was small (1.3%), indicating that the batch x
treatment design is not highly unbalanced. As a result, unweighted
PLSDA-batch remained applicable, and the weighted version was not used.
Among all methods, the data corrected with PLSDA-batch showed the best
performance, explaining a higher proportion of treatment variance
compared to the others. Notably, replicates in AD data
are not present across all batches, highlighting the critical role of
sample replicates in the effectiveness of RUVIII. Therefore, we
calculated pseudo replicates and recalculated the batch effect corrected
data. The result showed clear improvement, which can be found in the
section “To go further”.
Other methods
\(\mathbf{R^2}\)
We can also calculate the \(R^2\)
from one-way ANOVA for each variable to evaluate the proportion of
explained variance, using lm() from stats package.
To ensure comparability of \(R^2\)
values across variables, we will scale the corrected data prior to the
\(R^2\) calculation. The results will
then be visualised by ggplot() from ggplot2 R
package.
# AD data
# scale
ad.corr_scale.list <- lapply(ad.corrected.list,
function(x){apply(x, 2, scale)})
# perform one-way ANOVA on each variable within each dataset
ad.r_values.list <- list()
for(i in seq_len(length(ad.corr_scale.list))){
# for each dataset
ad.r_values <- data.frame(trt = NA, batch = NA)
for(c in seq_len(ncol(ad.corr_scale.list[[i]]))){
# for each variable
ad.fit.res.trt <- lm(ad.corr_scale.list[[i]][,c] ~ ad.trt)
ad.r_values[c,1] <- summary(ad.fit.res.trt)$r.squared
ad.fit.res.batch <- lm(ad.corr_scale.list[[i]][,c] ~ ad.batch)
ad.r_values[c,2] <- summary(ad.fit.res.batch)$r.squared
}
ad.r_values.list[[i]] <- ad.r_values
}
names(ad.r_values.list) <- names(ad.corr_scale.list)
head(ad.r_values.list$`Before correction`)
# generate boxplots for each dataset
ad.boxp.list <- list()
for(i in seq_len(length(ad.r_values.list))){
ad.boxp.list[[i]] <-
data.frame(r2 = c(ad.r_values.list[[i]][ ,'trt'],
ad.r_values.list[[i]][ ,'batch']),
Effects = as.factor(rep(c('Treatment','Batch'),
each = 231)))
}
names(ad.boxp.list) <- names(ad.r_values.list)
head(ad.boxp.list$`Before correction`)
As shown in the plots, the data corrected using ComBat still
contained a few variables with a large proportion of batch variance. In
the case of RUVIII, the corrected data exhibited a greater proportion of
batch variance than treatment variance.
Alignment scores
Before applying alignment_score() function from
PLSDAbatch, we need to specify the proportion of data variance
to explain (var), the number of nearest neighbours
(k) and the number of PCs to calculate (ncomp)
for the internal PCA. We can then use ggplot() function
from ggplot2 to visualise the results.
# calculate the alignment scores
ad.scores <- c()
names(ad.batch) <- rownames(ad.clr)
for(i in seq_len(length(ad.corrected.list))){
res <- alignment_score(data = ad.corrected.list[[i]],
batch = ad.batch,
var = 0.95,
k = 8,
ncomp = 50)
ad.scores <- c(ad.scores, res)
}
head(ad.scores)
[1] 0.3766892 0.7179054 0.6689189 0.5304054
# rearrange the data for ggplot
ad.scores.df <- data.frame(scores = ad.scores,
methods = names(ad.corrected.list))
ad.scores.df$methods <- factor(ad.scores.df$methods,
levels = rev(names(ad.corrected.list)))
head(ad.scores.df)
The alignment scores complement the PCA results, especially when
batch effect removal is difficult to evaluate from PCA sample plots
alone. For example, in the previous PCA plots (see section “PCA”), we
observed that samples from different batches appeared more intermixed
after batch effect correction, regardless of the method used. However,
comparing the performance of different methods remained challenging.
Since a higher alignment score indicates better mixing of samples
across batches, the bar plot above shows that ComBat achieved superior
performance compared to the other methods. However, the \(R^2\) analysis revealed that the data
corrected with ComBat still contained a few variables with a large
proportion of batch variance (see section “\(R^2\)”). This highlights the importance of
evaluating correction effectiveness using multiple techniques to ensure
an unbiased assessment.
In this example, the lower alignment score observed for the data
corrected using PLSDA-batch compared to ComBat may be due to differences
in the PCA sample projections. Specifically, the data corrected with
ComBat exhibited greater variance in its PCA projection, whereas the
data with PLSDA-batch showed smaller variance. Smaller variance in the
projection can lead to lower alignment scores, as it increases the
likelihood that samples from different batches appear as nearest
neighbours. Nonetheless, the pRDA results (see section “pRDA”)
quantitatively confirmed that PLSDA-batch effectively removed batch
variance entirely (Wang and Lê Cao
2023).
Discussion and conclusions
This workshop presents a comprehensive framework for managing batch
effects, using microbiome data as an illustrative example. The input for
this workflow is preprocessed data. For different types of omics data,
the preprocessing steps may vary.
To detect batch effects, both visual tools (e.g., PCA, boxplots,
density plots, and heatmaps) and quantitative methods such as pRDA can
be applied. If batch effects appear negligible, for instance, when pRDA
shows only a minimal proportion of variance explained by batch or when
PCA does not reveal clear batch driven clusters, batch effect management
may not be necessary. However, when batch effects are substantial, two
strategies can be considered: accounting for batch effects during
modeling, or removing them from the data prior to analysis.
Both strategies assume a balanced batch × treatment design. If the
design is nested, batch effects can only be accounted for using a linear
mixed model. If the design is unbalanced but not nested, batch effects
can either be controlled using standard linear models or removed using
weighted PLSDA-batch.
In most cases, batch grouping information is assumed to be known.
When it is not, batch estimation methods such as RUV4, RUVIII and SVA
can be employed. RUV-based methods rely on negative control variables
and/or sample replicates, while SVA estimates batch effects from
variables that are least affected by treatment (see section “To go
further”). However, for RUV methods, these negative controls and sample
replicates must capture the full spectrum of batch variation; otherwise,
batch effects may not be completely addressed. We emphasised this issue
in the section “Assessing batch effect correction - pRDA”, where limited
replicates posed challenges for batch effect removal. Moreover, these
estimation methods assume that batch effects are independent of
treatment. When batch and treatment are correlated, the batch effect
cannot be accurately estimated and may lead to spurious
associations.
Additionally, most methods for batch effect management assume
systematic batch effects across variables, and some models are highly
influenced by this assumption (e.g., SVA, ComBat and RUV methods).
Therefore, we recommend validating the model before applying it.
The next critical step is to evaluate the effectiveness of batch
effect management. While such evaluation is often implicit in methods
that account for batch effects, it is essential for methods that remove
them. This can be done by comparing the data before and after correction
using the same tools employed for batch effect detection. Additionally,
\(R^2\) values can be calculated for
each microbial variable to quantify the variance explained by batch and
treatment, and alignment scores can be used to assess how well samples
from different batches are mixed. However, individual evaluation tools
may have limitations. For example, PCA relies on visual interpretation,
while alignment scores focus solely on distances in the PCA projection
space. Therefore, a robust conclusion should be based on multiple
complementary evaluation methods.
Once batch effects have been adequately addressed, downstream
analyses, such as multivariate discriminant analysis or univariate
differential analysis, can be performed. Among these, multivariate
methods are often more appropriate for microbiome data, given the
natural correlations among microbial variables arising from biological
interactions.
To go further
Accounting for batch effects
Methods designed for microbiome data
Zero-inflated Gaussian mixture model
To use the ZIG model, we first need to create an
MRexperiment object by applying
newMRexperiment() (from metagenomeSeq package) to
microbiome counts and annotated data frames containing metadata and
taxonomic information generated using AnnotatedDataFrame()
from Biobase package.
# Creating a MRexperiment object (make sure no NA in metadata)
AD.phenotypeData = AnnotatedDataFrame(data = as.data.frame(ad.metadata))
AD.taxaData = AnnotatedDataFrame(data = as.data.frame(colData(AD_data$FullData)))
AD.obj = newMRexperiment(counts = t(ad.count),
phenoData = AD.phenotypeData,
featureData = AD.taxaData)
AD.obj
The AD count data are then filtered with
filterData() function (from metagenomeSeq
package). We can use MRcounts() to extract the count data
from the MRexperiment object.
# filtering data to maintain a threshold of minimum depth or OTU presence
dim(MRcounts(AD.obj))
AD.obj = filterData(obj = AD.obj, present = 20, depth = 5)
dim(MRcounts(AD.obj))
After filtering, the AD count data were reduced to
289 OTUs and 75 samples.
We will then calculate the percentile for CSS normalisation with
cumNormStatFast() function (from metagenomeSeq
package). The CSS normalisation can be applied with
cumNorm() and the normalised data can be exported using
MRcounts() with norm = TRUE. The normalisation
scaling factors for each sample, which are the sum of counts up to the
calculated percentile, can be accessed through
normFactors(). We can calculate the log transformed scaling
factors by diving them with their median, which are better than the
default scaling factors that are divided by 1000
(log2(normFactors(obj)/1000 + 1)).
# calculate the percentile for CSS normalisation
AD.pctl = cumNormStatFast(obj = AD.obj)
# CSS normalisation
AD.obj <- cumNorm(obj = AD.obj, p = AD.pctl)
# export normalised data
AD.norm.data <- MRcounts(obj = AD.obj, norm = TRUE)
# normalisation scaling factors for each sample
AD.normFactor = normFactors(object = AD.obj)
AD.normFactor = log2(AD.normFactor/median(AD.normFactor) + 1)
After that, let’s create a design matrix with treatment variable
(phenol_conc), batch variable (seq_run) and
the log transformed scaling factors using model.matrix(),
and then apply the ZIG model by fitZig() function. We
should set useCSSoffset = FALSE to avoid using the default
scaling factors as we have already included our customised scaling
factor (AD.normFactor) in the design matrix.
# treatment variable
phenol_conc = pData(object = AD.obj)$initial_phenol_concentration.regroup
# batch variable
seq_run = pData(object = AD.obj)$sequencing_run_date
# build a design matrix
AD.mod.full = model.matrix(~ phenol_conc + seq_run + AD.normFactor)
# settings for the fitZig() function
AD.settings <- zigControl(maxit = 10, verbose = TRUE)
# apply the ZIG model
ADfit <- fitZig(obj = AD.obj, mod = AD.mod.full,
useCSSoffset = FALSE, control = AD.settings)
The OTUs with the top 50 smallest p values can be extracted using
MRcoefs(). Let’s set eff = 0.5, so only the
OTUs with at least “0.5” quantile (50%) number of effective samples
(positive samples + estimated undersampling zeroes) are extracted.
ADcoefs <- MRcoefs(ADfit, coef = 2, group = 3, number = 50, eff = 0.5)
head(ADcoefs)
Methods adapted for microbiome data
SVA
SVA only accounts for unknown batch effects. Therefore, we assume
that the batch grouping information in the AD data is
unknown. We first need to build two design matrices with
(ad.mod) and without (ad.mod0) treatment
grouping information generated with model.matrix() function
from stats. We then need to use num.sv() from
sva package to determine the number of batch variables
n.sv that is used to estimate batch effects in function
sva().
Number of significant surrogate variables is: 4
Iteration (out of 5 ):1 2 3 4 5
The estimated batch effects are then input into
f.pvalue() to calculate the P-values of treatment effects.
The estimated batch effects in SVA are assumed to be independent of the
treatment effects. However, SVA can tolerate some sort of correlation
between batch and treatment effects, but only to a limited extent (Wang and LêCao 2020).
# include estimated batch effects in the linear model
ad.mod.batch <- cbind(ad.mod, ad.sva$sv)
ad.mod0.batch <- cbind(ad.mod0, ad.sva$sv)
ad.sva.p <- f.pvalue(t(ad.clr), ad.mod.batch, ad.mod0.batch)
ad.sva.p.adj <- p.adjust(ad.sva.p, method = 'fdr')
Correcting for batch effects
removeBatchEffect
Before applying removeBatchEffect, we need to prepare a design matrix
(design) that includes the treatment grouping information,
which will be preserved during batch effect correction. This design
matrix can be generated using model.matrix() function from
stats.
Then we can use removeBatchEffect() function
(limma package) with input batch grouping information
(batch) and treatment design matrix (design)
to calculate batch effect corrected data ad.rBE.
To apply sPLSDA-batch, we will use the same function
PLSDA_batch(), but we need to specify the number of
variables to select on each component (usually only treatment-related
components keepX.trt). To determine the optimal number of
variables to select, we will use tune.splsda() function
from mixOmics package (Rohart et al.
2017) with all possible numbers of variables to select for each
component (test.keepX).
# estimate the number of variables to select per treatment component
set.seed(777)
ad.test.keepX = c(seq(1, 10, 1), seq(20, 100, 10),
seq(150, 231, 50), 231)
ad.trt.tune.v <- tune.splsda(X = ad.clr, Y = ad.trt,
ncomp = 1, test.keepX = ad.test.keepX,
validation = 'Mfold', folds = 4,
nrepeat = 50)
ad.trt.tune.v$choice.keepX #100
Here, the optimal number of variables to select for the treatment
component was 100. Since we have adjusted the amount of treatment
variation to preserve, we need to re-estimate the optimal number of
components related to batch effects using the same criterion mentioned
in section “PLSDA-batch”.
# estimate the number of batch components
ad.batch.tune <- PLSDA_batch(X = ad.clr,
Y.trt = ad.trt, Y.bat = ad.batch,
ncomp.trt = 1, keepX.trt = 100,
ncomp.bat = 10)
ad.batch.tune$explained_variance.bat #4
Note: for unbalanced batch x treatment design (with the exception of
the nested design), we can specify balance = FALSE in
PLSDA_batch() function to apply weighted PLSDA-batch.
Percentile Normalisation
For PN correction, let’s apply percentile_norm()
function from PLSDAbatch package and specify a control group
(ctrl.grp).
Chapleur, Olivier, Céline Madigou, Raphaël Civade, Yohan Rodolphe,
Laurent Mazéas, and Théodore Bouchez. 2016. “Increasing
Concentrations of Phenol Progressively Affect Anaerobic Digestion of
Cellulose and Associated Microbial Communities.”Biodegradation 27 (1): 15–27.
LÃŒdecke, Daniel, Dominique Makowski, Philip Waggoner, and Indrajeet
Patil. 2020. “Performance: Assessment of Regression Models
Performance.”CRAN. https://doi.org/10.5281/zenodo.3952174.
Leek, Jeffrey T, W Evan Johnson, Hilary S Parker, Andrew E Jaffe, and
John D Storey. 2012. “The Sva Package for Removing Batch Effects
and Other Unwanted Variation in High-Throughput Experiments.”Bioinformatics 28 (6): 882–83.
Moskovicz, Veronica, Rina Ben-El, Guy Horev, and Boaz Mizrahi. 2020.
“Skin Microbiota Dynamics Following b. Subtilis Formulation
Challenge.”
Rohart, Florian, Benoı̂t Gautier, Amrit Singh, and Kim-Anh Lê Cao. 2017.
“mixOmics: An r Package for ‘Omics Feature Selection and Multiple
Data Integration.”PLoS Computational Biology 13 (11):
e1005752.
Wang, Yiwen, and Kim-Anh Lê Cao. 2023. “PLSDA-Batch: A
Multivariate Framework to Correct for Batch Effects in Microbiome
Data.”Briefings in Bioinformatics 24 (2): bbac622.
Wang, Yiwen, and Kim-Anh LêCao. 2020. “Managing Batch Effects in
Microbiome Data.”Briefings in Bioinformatics 21 (6):
1954–70.